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Abstract —Relational machine learning studies methods for the 
statistical analysis of relational, or graph-structured, data. In this 
paper, we provide a review of how such statistical models can be 
“trained” on large knowledge graphs, and then used to predict 
new facts about the world (which is equivalent to predicting new 
edges in the graph). In particular, we discuss two fundamentally 
different kinds of statistical relational models, both of which can 
scale to massive datasets. The first is based on latent feature mod¬ 
els such as tensor factorization and multiway neural networks. 
The second is based on mining observable patterns in the graph. 
We also show how to combine these latent and observable models 
to get improved modeling power at decreased computational cost. 
Finally, we discuss how such statistical models of graphs can be 
combined with text-based information extraction methods for 
automatically constructing knowledge graphs from the Web. To 
this end, we also discuss Google’s Knowledge Vault project as an 
example of such combination. 

Index Terms —Statistical Relational Learning, Knowledge 
Graphs, Knowledge Extraction, Latent Feature Models, Graph- 
based Models 


I. Introduction 

I am convinced that the crux of the problem of learning 
is recognizing relationships and being able to use them. 

Christopher Strachey in a letter to Alan Turing, 1954 

T RADITIONAL machine learning algorithms take as input 
a feature vector, which represents an object in terms of 
numeric or categorical attributes. The main learning task is to 
learn a mapping from this feature vector to an output prediction 
of some form. This could be class labels, a regression score, 
or an unsupervised cluster id or latent vector (embedding). In 
Statistical Relational Learning (SRL), the representation of an 
object can contain its relationships to other objects. Thus the 
data is in the form of a graph , consisting of nodes (entities) 
and labelled edges (relationships between entities). The main 
goals of SRL include prediction of missing edges, prediction 
of properties of nodes, and clustering nodes based on their 
connectivity patterns. These tasks arise in many settings such 
as analysis of social networks and biological pathways. For 
further information on SRL see HU E n. 

In this article, we review a variety of techniques from the 
SRL community and explain how they can be applied to 
large-scale knowledge graphs (KGs), i.e., graph structured 
knowledge bases (KBs) that store factual information in 
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form of relationships between entities. Recently, a large 
number of knowledge graphs have been created, including 
YAGO H, DBpedia Q, NELL 0, Freebase 13, and the 
Google Knowledge Graph 0. As we discuss in Section [II] 
these graphs contain millions of nodes and billions of edges. 
This causes us to focus on scalable SRL techniques, which 
take time that is (at most) linear in the size of the graph. 

We can apply SRL methods to existing KGs to learn a 
model that can predict new facts (edges) given existing facts. 
We can then combine this approach with information extraction 
methods that extract “noisy” facts from the Web (see e.g., 0 
mi For example, suppose an information extraction method 
returns a fact claiming that Barack Obama was born in Kenya, 
and suppose (for illustration purposes) that the true place of 
birth of Obama was not already stored in the knowledge graph. 
An SRL model can use related facts about Obama (such as his 
profession being US President) to infer that this new fact is 
unlikely to be true and should be discarded. This provides us 
a way to “grow” a KG automatically, as we explain in more 
detail in Section HXl 

The remainder of this paper is structured as follows. In 
Section [TT] we introduce knowledge graphs and some of their 
properties. Section [III] discusses SRL and how it can be applied 
to knowledge graphs. There are two main classes of SRL 
techniques: those that capture the correlation between the 
nodes/edges using latent variables, and those that capture 
the correlation directly using statistical models based on the 
observable properties of the graph. We discuss these two 


describes methods for combining these two approaches, in 
order to get the best of both worlds. Section |VII| discusses 
how such models can be trained on KGs. In Section [Villi we 
discuss relational learning using Markov Random Fields. In 
Section [IX] we describe how SRL can be used in automated 
knowledge base construction projects. In Section [Xl we discuss 
extensions of the presented methods, and Sectionjxil presents 
our conclusions. 


families in Section IV and Section M respectively. Section VI 


II. Knowledge Graphs 

In this section, we introduce knowledge graphs, and discuss 
how they are represented, constructed, and used. 

A. Knowledge representation 

Knowledge graphs model information in the form of entities 
and relationships between them. This kind of relational knowl¬ 
edge representation has a long history in logic and artificial 
intelligence ED, for example, in semantic networks EH and 
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Spock Science Fiction Obi-Wan Kenobi 



played characterln genre genre characterln played 



Leonard Nimoy Star Trek Star Wars Alec Guinness 

Fig. 1. Sample knowledge graph. Nodes represent entities, edge labels represent 
types of relations, edges represent existing relationships. 


frames m. More recently, it has been used in the Semantic 
Web community with the purpose of creating a “web of data” 
that is readable by machines m. While this vision of the 
Semantic Web remains to be fully realized, parts of it have 
been achieved. In particular, the concept of linked data OHS) 
has gained traction, as it facilitates publishing and interlinking 
data on the Web in relational form using the W3C Resource 
Description Framework (RDF) fl7l H8l . (For an introduction 
to knowledge representation, see e.g. dummy 

In this article, we will loosely follow the RDF standard and 
represent facts in the form of binary relationships, in particular 
(subject, predicate, object) (SPO) triples, where subject and 
object are entities and predicate is the relation between 
them. (We discuss how to represent higher-arity relations 
in Section X-A) The existence of a particular SPO triple 
indicates an existing fact, i.e., that the respective entities are in 
a relationship of the given type. For instance, the information 
Leonard Nimoy was an actor who played the char¬ 
acter Spock in the science-fiction movie Star Trek 
can be expressed via the following set of SPO triples: 


subject 


predicate object 


( LeonardNimoy, 
( LeonardNimoy, 
( LeonardNimoy, 
( Spock, 

( StarTrek, 


profession, 

starredln, 

played, 

characterln, 

genre, 


Actor) 

StarTrek) 

Spock) 

StarTrek) 
ScienceFiction) 


We can combine all the SPO triples together to form a multi¬ 
graph, where nodes represent entities (all subjects and objects), 
and directed edges represent relationships. The direction of an 
edge indicates whether entities occur as subjects or objects, i.e., 
an edge points from the subject to the object. Different relations 
are represented via different types of edges (also called edge 
labels). This construction is called a knowledge graph (KG), 
or sometimes a heterogeneous information network ED.) See 
Figure [T] for an example. 

In addition to being a collection of facts, knowledge graphs 
often provide type hierarchies (Leonard Nimoy is an actor, 
which is a person, which is a living thing) and type constraints 
(e.g., a person can only marry another person, not a thing). 


B. Open vs. closed world assumption 


TABLE I 

Knowledge Base Construction Projects 


Method 

Schema 

Examples 

Curated 

Yes 

Cyc/OpenCyc 1231, WordNet 1241. 
UMLS (25j 

Collaborative 

Yes 

Wikidata 1261, Freebase (7) 

Auto. Semi-Structured 

Yes 

YAGO (4||27j, DBPedia 0, 
Freebase (7) 

Auto. Unstructured 

Yes 

Knowledge Vault (HQ, NELL 0, 
PATTY jt9j, PROSPERA l30l. 
DeepDive/Elementary I Til 

Auto. Unstructured 

No 

ReVerb (32), OLLIE (33], 
PRISMATIC (34) 


non-existing triples: 

• Under the closed world assumption (CWA), non-existing 
triples indicate false relationships. For example, the fact 
that in Figure |T] there is no starredln edge from Leonard 
Nimoy to Star Wars is interpreted to mean that Nimoy 
definitely did not star in this movie. 

. Under the open world assumption (OWA), a non-existing 
triple is interpreted as unknown, i.e., the corresponding 
relationship can be either true or false. Continuing with the 
above example, the missing edge is not interpreted to mean 
that Nimoy did not star in Star Wars. This more cautious 
approach is justified, since KGs are known to be very 
incomplete. For example, sometimes just the main actors 
in a movie are listed, not the complete cast. As another 
example, note that even the place of birth attribute, which 
you might think would be typically known, is missing for 
71% of all people included in Freebase l22l . 

RDF and the Semantic Web make the open-world assumption. 
In Section IVII-BI we also discuss the local closed world 
assumption (LCWA), which is often used for training relational 
models. 

C. Knowledge base construction 

Completeness, accuracy, and data quality are important 
parameters that determine the usefulness of knowledge bases 
and are influenced by the way knowledge bases are constructed. 
We can classify KB construction methods into four main 
groups: 

. In curated approaches, triples are created manually by a 
closed group of experts. 

. In collaborative approaches, triples are created manually 
by an open group of volunteers. 

• In automated semi-structured approaches, triples are 
extracted automatically from semi-structured text (e.g., 
infoboxes in Wikipedia) via hand-crafted rules, learned 
rules, or regular expressions. 

• In automated unstructured approaches, triples are ex¬ 
tracted automatically from unstructured text via machine 
learning and natural language processing techniques (see, 
e.g., m for a review). 


While existing triples always encode known true relationships 
(facts), there are different paradigms for the interpretation of 


Construction of curated knowledge bases typically leads to 
highly accurate results, but this technique does not scale well 
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due to its dependence on human experts. Collaborative knowl¬ 
edge base construction, which was used to build Wikipedia 
and Freebase, scales better but still has some limitations. For 
instance, as mentioned previously, the place of birth attribute 
is missing for 71% of all people included in Freebase, even 
though this is a mandatory property of the schema l22l . Also, 
a recent study [35l found that the growth of Wikipedia has 
been slowing down. Consequently, automatic knowledge base 
construction methods have been gaining more attention. 

Such methods can be grouped into two main approaches. The 
first approach exploits semi-structured data, such as Wikipedia 
infoboxes, which has led to large, highly accurate knowledge 
graphs such as YAGO l4l l27ll and DBpedia Q. The accuracy 
(trustworthiness) of facts in such automatically created KGs is 
often still very high. For instance, the accuracy of YAG02 has 
been estimatecQ to be over 95% through manual inspection 
of sample facts [36), and the accuracy of Freebase [7) was 
estimated to be 9991(2] However, semi-structured text still covers 
only a small fraction of the information stored on the Web, and 
completeness (or coverage) is another important aspect of KGs. 
Hence the second approach tries to “read the Web”, extracting 
facts from the natural language text of Web pages. Example 
projects in this category include NELL |6) and the Knowledge 
Vault lf28l . In Section [IX] we show how we can reduce the 
level of “noise” in such automatically extracted facts by using 
the knowledge from existing, high-quality repositories. 

KGs, and more generally KBs, can also be classified based 
on whether they employ a fixed or open lexicon of entities and 
relations. In particular, we distinguish two main types of KBs: 

. In schema-based approaches, entities and relations are 
represented via globally unique identifiers and all pos¬ 
sible relations are predefined in a fixed vocabulary. For 
example, Freebase might represent the fact that Barack 
Obama was born in Hawaii using the triple (/m/02mjmr, 
/people/person/born-in, /m/03gh4), where /m/02mjmr is 
the unique machine ID for Barack Obama. 

. In schema-free approaches, entities and relations are 
identified using open information extraction (OpenIE) 
techniques [37], and represented via normalized but not 
disambiguated strings (also referred to as surface names). 
For example, an OpenIE system may contain triples such 
as (“Obama”, “born in”, “Hawaii”), (“Barack Obama”, 
“place of birth”, “Honolulu”), etc. Note that it is not clear 
from this representation whether the first triple refers to 
the same person as the second triple, nor whether “born 
in” means the same thing as “place of birth”. This is the 
main disadvantage of OpenIE systems. 

'For detailed statistics see http://www.mpi-inf.mpg.de/departments/ 
databases-and-information-systems/research/yago-naga/yago/statistics/ 
-http://thenoisychannel.com/2011/1 l/15/cikm-2011-industry-event-john- 
giannandrea- on- freebase- a- rosetta- stone- for- entities 
'Non-redundant triples, see 128] Table 1] 

4 Last published numbers: https://tools.wmflabs.org/wikidata-todo/stats.php 
and https://www.wikidata.Org/wiki/Category:All_Properties 

5 English content. Version 2014 from http://wiki.dbpedia.org/data-set-2014 
6 See (27] Table 5] 

7 Last published numbers: http://insidesearch.blogspot.de/2012/12/get- 
smarter-answers-from-knowledge_4.html 


TABLE II 

Size of some schema-based knowledge bases 


Knowledge Graph 


Number of 


Entities 

Relation Types 

Facts 

Freebast 

3| 

40 M 

35,000 

637 M 

Wikidat: 


18 M 

1,632 

66 M 

DBpedia 

S en l!J 

4.6 M 

1,367 

538 M 

YAG02 

6] LJ 

9.8 M 

114 

447 M 

Google 

Knowledge Graplj^J 

570 M 

35,000 

18,000 M 


Table [T] lists current knowledge base construction projects 
classified by their creation method and data schema. In this 
paper, we will only focus on schema-based KBs. Table [TI] shows 
a selection of such KBs and their sizes. 

D. Uses of knowledge graphs 

Knowledge graphs provide semantically structured informa¬ 
tion that is interpretable by computers — a property that is 
regarded as an important ingredient to build more intelligent 
machines j38l . Consequently, knowledge graphs are already 
powering multiple “Big Data” applications in a variety of 
commercial and scientific domains. A prime example is the 
integration of Google’s Knowledge Graph, which currently 
stores 18 billion facts about 570 million entities, into the 
results of Google’s search engine BO. The Google Knowledge 
Graph is used to identify and disambiguate entities in text, to 
enrich search results with semantically structured summaries, 
and to provide links to related entities in exploratory search. 
(Microsoft has a similar KB, called Satori, integrated with its 
Bing search engine ll39l .) 

Enhancing search results with semantic information from 
knowledge graphs can be seen as an important step to transform 
text-based search engines into semantically aware question 
answering services. Another prominent example demonstrating 
the value of knowledge graphs is IBM’s question answering 
system Watson, which was able to beat human experts in the 
game of Jeopardy!. Among others, this system used YAGO, 
DBpedia, and Freebase as its sources of information ED. 
Repositories of structured knowledge are also an indispensable 
component of digital assistants such as Siri, Cortana, or Google 
Now. 

Knowledge graphs are also used in several specialized 
domains. For instance, Bio2RDF ED, Neurocommons ED, 
and LinkedLifeData |43) are knowledge graphs that integrate 
multiple sources of biomedical information. These have been 
used for question answering and decision support in the life 
sciences. 

E. Main tasks in knowledge graph construction and curation 

In this section, we review a number of typical KG tasks. 

Link prediction is concerned with predicting the existence (or 
probability of correctness) of (typed) edges in the graph (i.e., 
triples). This is important since existing knowledge graphs are 
often missing many facts, and some of the edges they contain 
are incorrect (HI. In the context of knowledge graphs, link 
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prediction is also referred to as knowledge graph completion. 
For example, in Figure [T| suppose the characterln edge from 
Obi-Wan Kenobi to Star Wars were missing; we might be able 
to predict this missing edge, based on the structural similarity 
between this part of the graph and the part involving Spock 
and Star Trek. It has been shown that relational models that 
take the relationships of entities into account can significantly 
outperform non-relational machine learning methods for this 
task (e.g., see mmy 

Entity resolution (also known as record linkage 03, 
object identification li48l . instance matching l49l . and de- 
duplication Il50l l is the problem of identifying which objects 
in relational data refer to the same underlying entities. See 
Figure [2] for a small example. In a relational setting, the 
decisions about which objects are assumed to be identical 
can propagate through the graph, so that matching decisions 
are made collectively for all objects in a domain rather 
than independently for each object pair (see, for example, 
lf5Tl 52. El). In schema-based automated knowledge base 
construction, entity resolution can be used to match the 
extracted surface names to entities stored in the knowledge 
graph. 

Link-based clustering extends feature-based clustering to a 
relational learning setting and groups entities in relational data 
based on their similarity. However, in link-based clustering, 
entities are not only grouped by the similarity of their features 
but also by the similarity of their links. As in entity resolution, 
the similarity of entities can propagate through the knowledge 
graph, such that relational modeling can add important infor¬ 
mation for this task. In social network analysis, link-based 
clustering is also known as community detection l54l . 

III. Statistical Relational Learning for 
Knowledge Graphs 

Statistical Relational Learning is concerned with the creation 
of statistical models for relational data. In the following sections 
we discuss how statistical relational learning can be applied 
to knowledge graphs. We will assume that all the entities 
and (types of) relations in a knowledge graph are known. (We 
discuss extensions of this assumption in Section |X^C}. However, 
triples are assumed to be incomplete and noisy; entities and 
relation types may contain duplicates. 

Notation: Before proceeding, let us define our mathematical 
notation. (Variable names will be introduced later in the 
appropriate sections.) We denote scalars by lower case letters, 
such as a; column vectors (of size TV) by bold lower case letters, 
such as a; matrices (of size TVi x TV2) by bold upper case letters, 
such as A; and tensors (of size TVi x TV2 x TV3) by bold upper 
case letters with an underscore, such as A. We denote the 
fc’th “frontal slice” of a tensor A by A;,, (which is a matrix of 
size Ni x TV 2 ), and the (i,j, fc)’th element by (which is a 
scalar). We use [a; b] to denote the vertical stacking of vectors 

a and b, i.e., [a; b] = ■ We can convert a matrix A of size 

TVi x TV2 into a vector a of size TV1TV2 by stacking all columns 
of A, denoted a = vec(A). The inner (scalar) product of two 
vectors (both of size TV) is defined by a T b = Xlili a iM- The 
tensor (Kronecker) product of two vectors (of size Ni and TV 2 ) 



Fig. 3. Tensor representation of binary relational data. 


is a vector of size TV 1 TV 2 with entries a(x)b 


/ aib 


Matrix 


\a Nl b J 

multiplication is denoted by AB as usual. We denote the L 2 
norm of a vector by 11a| 1 2 = y/Si a ?> an d the Frobenius norm 
of a matrix by ||A||^ = a lj■ We denote the vector 

of all ones by 1, and the identity matrix by I. 


A. Probabilistic knowledge graphs 

We now introduce some mathematical background so we can 
more formally define statistical models for knowledge graphs. 

Let 8 = {ei,..., ejv e } be the set of all entities and 
TZ = {r 1 ,..., r,v T .} be the set of all relation types in a knowl¬ 
edge graph. We model each possible triple Xijk = ( ei,rk,ej) 
over this set of entities and relations as a binary random variable 
Uijk e {0,1} that indicates its existence. All possible triples in 
8 x TZ x 8 can be grouped naturally in a third-order tensor 
(three-way array) Ye (0,1) e e ”, whose entries are set 
such that 

{ 1 , if the triple (ei,rk,ej) exists 
0 , otherwise. 

We will refer to this construction as an adjacency tensor (cf. 
Figure [3j. Each possible realization of Y can be interpreted as 
a possible world. To derive a model for the entire knowledge 
graph, we are then interested in estimating the joint distribution 
P(Y), from a subset V c 8 x TZ x 8 x {0,1} of observed 
triples. In doing so, we are estimating a probability distribution 
over possible worlds, which allows us to predict the probability 
of triples based on the state of the entire knowledge graph. 
While yijk = 1 in adjacency tensors indicates the existence of 
a triple, the interpretation of ijijk = 0 depends on whether the 
open world, closed world, or local-closed world assumption is 
made. For details, see Section |VII-B| 

Note that the size of Y can be enormous for large knowledge 
graphs. For instance, in the case of Freebase, which currently 
consists of over 40 million entities and 35, 000 relations, the 
number of possible triples \8 x TZ x 8 | exceeds 10 19 elements. 
Of course, type constraints reduce this number considerably. 

Even amongst the syntactically valid triples, only a tiny 
fraction are likely to be true. For example, there are over 
450,000 thousands actors and over 250,000 movies stored in 
Freebase. But each actor stars only in a small number of movies. 
Therefore, an important issue for SRL on knowledge graphs is 
how to deal with the large number of possible relationships 
while efficiently exploiting the sparsity of relationships. Ideally, 
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Guinness 



knownFor type 

Arthur Guinness Beer 


Star Wars 


A. Guinness knownFor type 

cC 

knownFor type 


Doctor Zhivago 


type 




knownFor 

Ad 

Alec Guinness 


The Bridge on the River Kwai 

Fig. 2. Example of entity resolution in a toy knowledge graph. In this example, nodes 1 and 3 refer to the identical entity, the actor Alec Guinness. Node 2 on 
the other hand refers to Arthur Guinness, the founder of the Guinness brewery. The surface name of node 2 (“A. Guinness”) alone would not be sufficient to 
perform a correct matching as it could refer to both Alec Guinness and Arthur Guinness. However, since links in the graph reveal the occupations of the 
persons, a relational approach can perform the correct matching. 


a relational model for large-scale knowledge graphs should 
scale at most linearly with the data size, i.e., linearly in the 
number of entities N e , linearly in the number of relations N r , 
and linearly in the number of observed triples \V\ = Nd- 

B. Statistical properties of knowledge graphs 

Knowledge graphs typically adhere to some deterministic 
rules, such as type constraints and transitivity (e.g., if Leonard 
Nimoy was born in Boston, and Boston is located in the USA, 
then we can infer that Leonard Nimoy was born in the USA). 
However, KGs have typically also various “softer” statistical 
patterns or regularities, which are not universally true but 
nevertheless have useful predictive power. 

One example of such statistical pattern is known as ho- 
mophily, that is, the tendency of entities to be related to 
other entities with similar characteristics. This has been widely 
observed in various social networks ll55l |56l . For example, 
US-bom actors are more likely to star in US-made movies. For 
multi-relational data (graphs with more than one kind of link), 
homophily has also been referred to as autocorrelation ea. 

Another statistical pattern is known as block structure. This 
refers to the property where entities can be divided into distinct 
groups (blocks), such that all the members of a group have 
similar relationships to members of other groups ESI HI ED. 
For example, we can group some actors, such as Leonard 
Nimoy and Alec Guinness, into a science fiction actor block, 
and some movies, such as Star Trek and Star Wars, into a 
science fiction movie block, since there is a high density of 
links from the scifi actor block to the scifi movie block. 

Graphs can also exhibit global and long-range statistical 
dependencies, i.e., dependencies that can span over chains of 
triples and involve different types of relations. For example, 
the citizenship of Leonard Nimoy (USA) depends statistically 
on the city where he was born (Boston), and this dependency 
involves a path over multiple entities (Leonard Nimoy, Boston, 
USA) and relations (bornln, locatedln, citizenOf). A distinctive 
feature of relational learning is that it is able to exploit such 
patterns to create richer and more accurate models of relational 
domains. 

When applying statistical models to incomplete knowledge 
graphs, it should be noted that the distribution of facts in such 
KGs can be skewed. For instance, KGs that are derived from 
Wikipedia will inherit the skew that exists in distribution of 


facts in Wikipedia itselfStatistical models as discussed in 
the following sections can be affected by such biases in the 
input data and need to be interpreted accordingly. 


C. Types of SRL models 


As we discussed, the presence or absence of certain triples 
in relational data is correlated with (i.e., predictive of) the 
presence or absence of certain other triples. In other words, 
the random variables y l]k are correlated with each other. We 
will discuss three main ways to model these correlations: 

Ml) Assume all y,^. are conditionally independent given 
latent features associated with subject, object and 
relation type and additional parameters {latent feature 
models) 

M2) Assume all y l3 k are conditionally independent given 
observed graph features and additional parameters 
(graph feature models ) 

M3) Assume all y^k have local interactions {Markov 
Random Fields) 


In what follows we will mainly focus on Ml and M2 and their 


combination; M3 will be the topic of Section VIII 


The model classes Ml and M2 predict the existence of a 
triple Xijk via a score function f(x t .jk- 0) which represents 
the model’s confidence that a triple exists given the parameters 
0. The conditional independence assumptions of Ml and M2 
allow the probability model to be written as follows: 


N e N e N r 


P(Y|2?,e) = nnn Ber (y ijk \ a(f(x ijk ;Q))) 

i=lj=lk =1 


(i) 


where o(u) = 1/(1 + e u ) is the sigmoid (logistic) function, 
and 

Ber (y\ P ) = {P_ p = J (2) 

is the Bernoulli distribution. 

We will refer to models of the form Equation 0 as 
probabilistic models. In addition to probabilistic models, we 
will also discuss models which optimize /(•) under other 
criteria, for instance models which maximize the margin 


8 As an example, there are currently 10,306 male and 7,586 female American 
actors listed in Wikipedia, while there are only 1,268 male and 1,354 female 
Indian, and 77 male and no female Nigerian actors. India and Nigeria, however, 
are the largest and second largest film industries in the world. 
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between existing and non-existing triples. We will refer to 
such models as score-based models. If desired, we can derive 
probabilities for score-based models via Platt scaling ED. 

There are many different methods for defining /(•). In the 
following Sections IV to VI and VIII we will discuss different 


options for all model classes. In Section [VII] we will furthermore 
discuss aspects of how to train these models on knowledge 
graphs. 


IV. Latent Feature Models 


In this section, we assume that the variables y, : jk are 
conditionally independent given a set of global latent features 
and parameters, as in Equation [I] We discuss various possible 
forms for the score function f(x; 0) below. What all models 
have in common is that they explain triples via latent features 
of entities (This is justified via various theoretical arguments 
l62lD . For instance, a possible explanation for the fact that Alec 
Guinness received the Academy Award is that he is a good 
actor. This explanation uses latent features of entities (being a 
good actor) to explain observable facts (Guinness receiving the 
Academy Award). We call these features “latent” because they 
are not directly observed in the data. One task of all latent 
feature models is therefore to infer these features automatically 
from the data. 

In the following, we will denote the latent feature represen¬ 
tation of an entity e* by the vector e t e M. He where H e denotes 
the number of latent features in the model. For instance, we 
could model that Alec Guinness is a good actor and that the 
Academy Award is a prestigious award via the vectors 



"0.9 


"0.2" 

^Guinness 

0.2 

5 ^AcademyAward 

0.8 


where the component en corresponds to the latent feature 
Good Actor and ea correspond to Prestigious Award. (Note 
that, unlike this example, the latent features that are inferred 
by the following models are typically hard to interpret.) 

The key intuition behind relational latent feature models 
is that the relationships between entities can be derived from 
interactions of their latent features. However, there are many 
possible ways to model these interactions, and many ways to 
derive the existence of a relationship from them. We discuss 
several possibilities below. See Table III for a summary of the 
notation. 


A. RESCAL: A bilinear model 

RESCAL ll63l [64; , 651 is a relational latent feature model 
which explains triples via pairwise interactions of latent features. 
In particular, we model the score of a triple Xijk as 

fifk SCAL '■= e 7 w fc e j = 2 £ ™ abk e ia e jb (3) 

a = 16=1 

where Wfc e WL Hc><He is a weight matrix whose entries w a bk 
specify how much the latent features a and b interact in the 
/c-th relation. We call this a bilinear model, since it captures the 
interactions between the two entity vectors using multiplicative 
terms. For instance, we could model the pattern that good 


TABLE m 

Summary of the notation. 


Symbol 

Relational data 

Meaning 


N e 

Number of entities 


N r 

Number of relations 


N d 

Number of training examples 


ei 

z-th entity in the dataset (e.g., LeonardNimoy) 

rk 

k- th relation in the dataset (e.g., bornln ) 


V + 

Set of observed positive triples 


V ~ 

Set of observed negative triples 



Probabilistic Knowledge Graphs 

Symbol 

Meaning 

Size 

Y 

(Partially observed) labels for all triples 

N e X N e X N r 

F 

Score for all possible triples 

N e X N e X N r 


Slice of Y for relation r & 

N e X N e 

F k 

Slice of F for relation rk 

N e X Ne 


Graph and Latent Feature Models 

Symbol 

Meaning 



Feature vector representation of triple (e^, 

Tfc,e,-) 

w k 

Weight vector to derive scores for relation k 

© 

Set of all parameters of the model 


O-(-) 

Sigmoid (logistic) function 



Latent Feature Models 


Symbol 

Meaning 

Size 

H e 

Number of latent features for entities 


H r 

Number of latent features for relations 


ei 

Latent feature repr. of entity e* 

He 

ffc 

Latent feature repr. of relation r & 

Hr 

H a 

Size of h a layer 


H b 

Size of hb layer 


He 

Size of h c layer 


E 

Entity embedding matrix 

N e X He 

w fc 

Bilinear weight matrix for relation k 

He X He 


Linear feature map for pairs of entities 

(2 He) X H a 


for relation r & 


C 

Linear feature map for triples 

(2 H e + H r ) x H c 


actors are likely to receive prestigious awards via a weight 
matrix such as 


W 


receivedAward 


0.1 

0.9 

0.1 

0.1 


In general, we can model block structure patterns via the 
magnitude of entries in W&, while we can model homophily 
patterns via the magnitude of its diagonal entries. Anti¬ 
correlations in these patterns can be modeled via negative 
entries in W t . 

Hence, in Equation ([3]) we compute the score of a triple 
x^k via the weighted sum of all pairwise interactions between 
the latent features of the entities e* and ej. The parameters of 
the model are 0 = {{edfl'i, During training we 

jointly learn the latent representations of entities and how the 
latent features interact for particular relation types. 

In the following, we will discuss further important properties 
of the model for learning from knowledge graphs. 

Relational learning via shared representations: In equa¬ 
tion entities have the same latent representation regardless 
of whether they occur as subjects or objects in a relationship. 
Furthermore, they have the same representation over all 
different relation types. For instance, the i-th entity occurs 
in the triple x i; jk as the subject of a relationship of type k. 





















7 




Fig. 4. RESCAL as a tensor factorization of the adjacency tensor Y. 


while it occurs in the triple x P i q as the object of a relationship 
of type q. However, the predictions = ej W fc e y and 
fpiq = ej W, y c. t both use the same latent representation e, 
of the i-th entity. Since all parameters are learned jointly, 
these shared representations permit to propagate information 
between triples via the latent representations of entities and the 
weights of relations. This allows the model to capture global 
dependencies in the data. 

Semantic embeddings: The shared entity representations 
in RESCAL capture also the similarity of entities in the 
relational domain, i.e., that entities are similar if they are 
connected to similar entities via similar relations H65I . For 
instance, if the representations of e, and e p are similar, the 
predictions /, ;fc and f pl f. will have similar values. In return, 
entities with many similar observed relationships will have 
similar latent representations. This property can be exploited for 
entity resolution and has also enabled large-scale hierarchical 
clustering on relational data f63l l64j. Moreover, since relational 
similarity is expressed via the similarity of vectors, the latent 
representations e, can act as proxies to give non-relational 
machine learning algorithms such as /c-means or kernel methods 
access to the relational similarity of entities. 

Connection to tensor factorization: RESCAL is similar 
to methods used in recommendation systems l66l . and to 
traditional tensor factorization methods f67l . In matrix notation. 
Equation |3]l can be written compactly as as F= E W/.E T , 
where F k e R A? ' x ,/v ' is the matrix holding all scores for the 
fc-th relation and the i-th row of E e M. NbXH,: holds the latent 
representation of e, . See Figure [4] for an illustration. In the 
following, we will use this tensor representation to derive a 
very efficient algorithm for parameter estimation. 

Fitting the model: If we want to compute a probabilistic 
model, the parameters of RESCAL can be estimated by 
minimizing the log-loss using gradient-based methods such as 
stochastic gradient descent li68l . RESCAL can also be com¬ 
puted as a score-based model, which has the main advantage 
that we can estimate the parameters 0 very efficiently: Due 
to its tensor structure and due to the sparsity of the data, it 
has been shown that the RESCAL model can be computed 
via a sequence of efficient closed-form updates when using 
the squared-loss [63, 64|. In this setting, it has been shown 
analytically that a single update of E and scales linearly 
with the number of entities N e , linearly with the number of 
relations N r , and linearly with the number of observed triples, 
i.e., the number of non-zero entries in Y ]64l . We call this 


algorithm RESCAL-ALS |^] In practice, a small number (say 30 
to 50) of iterated updates are often sufficient for RESCAL-ALS 
to arrive at stable estimates of the parameters. Given a current 
estimate of E, the updates for each W k can be computed in 
parallel to improve the scalability on knowledge graphs with 
a large number of relations. Furthermore, by exploiting the 
special tensor structure of RESCAL, we can derive improved 
updates for RESCAL-ALS that compute the estimates for the 
parameters with a runtime complexity of O(H^) for a single 
update (as opposed to a runtime complexity of O(H^) for 
naive updates) l65l 16911. In summary, for relational domains 
that can be explained via a moderate number of latent features, 
RESCAL-ALS is highly scalable and very fast to compute. 
For more detail on RESCAL-ALS see also Equation \26) in 
Section IVm 

Decoupled Prediction: In Equation Q. the probability 
of single relationship is computed via simple matrix-vector 
products in 0(Hf) time. Hence, once the parameters have been 
estimated, the computational complexity to predict the score of 
a triple depends only on the number of latent features and is 
independent of the size of the graph. However, during parameter 
estimation, the model can capture global dependencies due to 
the shared latent representations. 

Relational learning results: RESCAL has been shown 
to achieve state-of-the-art results on a number of relational 
learning tasks. For instance, Il63l showed that RESCAL 
provides comparable or better relationship prediction results 
on a number of small benchmark datasets compared to 
Markov Logic Networks (with structure learning) ED, the 
Infinite (Hidden) Relational model GH EH, and Bayesian 
Clustered Tensor Factorization l73l . Moreover, RESCAL has 
been used for link prediction on entire knowledge graphs such 
as YAGO and DBpedia Il64ll74l Aside from link prediction, 
RESCAL has also successfully been applied to SRL tasks such 
as entity resolution and link-based clustering. For instance, 
RESCAL has shown state-of-the-art results in predicting which 
authors, publications, or publication venues are likely to be 
identical in publication databases EH |65!- Furthermore, the 
semantic embedding of entities computed by RESCAL has 
been exploited to create taxonomies for uncategorized data via 
hierarchical clusterings of entities in the embedding space ED. 

B. Other tensor factorization models 

Various other tensor factorization methods have been ex¬ 
plored for learning from knowledge graphs and multi-relational 
data. ca na factorized adjacency tensors using the CP 
tensor decomposition to analyze the link structure of Web 
pages and Semantic Web data respectively. ED applied 
pairwise interaction tensor factorization ED to predict triples 
in knowledge graphs. If80l applied factorization machines to 
large uni-relational datasets in recommendation settings. El 
proposed a tensor factorization model for knowledge graphs 
with a very large number of different relations. 

It is also possible to use discrete latent factors. If82l proposed 
Boolean tensor factorization to disambiguate facts extracted 
with OpenIE methods and applied it to large datasets l83l . In 

9 ALS stands for Alternating Least-Squares 















contrast to previously discussed factorizations. Boolean tensor 
factorizations are discrete models, where adjacency tensors are 
decomposed into binary factors based on Boolean algebra. 


C. Matrix factorization methods 

Another approach for learning from knowledge graphs is 
based on matrix factorization, where, prior to the factorization, 
the adjacency tensor Y e M. Nb x Nb x Nr is reshaped into a matrix 
Y G M. N e xNr by associating rows with subject-object pairs 
(ei,ej) and columns with relations r k (cf. Il84l [85l B. or into 
a matrix Y G R Ne x A N r by associating rows with subjects 
e,; and columns with relation/objects (rfc,ey) (cf. |[86l 1871). 
Unfortunately, both of these formulations lose information 
compared to tensor factorization. For instance, if each subject- 
object pair is modeled via a different latent representation, the 
information that the relationships and y pqq share the same 
object is lost. It also leads to an increased memory complexity, 
since a separate latent representation is computed for each pair 
of entities, requiring 0(N^H e + N r H e ) parameters (compared 
to 0(N e H e + N r Hf ) parameters for RESCAL). 


D. Multi-layer perceptrons 

We can interpret RESCAL as creating composite repre¬ 
sentations of triples and predicting their existence from this 
representation. In particular, we can rewrite RESCAL as 


/■RESCAL .. 
J ijk 

± RESCAL . 


wl0*f CAL 

ej <g>ej, 


(4) 

(5) 


where w/, : = vec (W/.). Equation (jdj follows from Equation ([3ji 
via the equality vec (AXB) = (B T (x) A) vec (X). Hence, 
RESCAL represents pairs of entities (e*, ef) via the tensor 
product of their latent feature representations (Equation ([5])) 
and predicts the existence of the triple Xijk from 0,. y via w/,. 
(Equation (|4|). See also Figure [5a] For a further discussion of 
the tensor product to create composite latent representations 
please see ll88l [89l [90i l. 

Since the tensor product explicitly models all pairwise 
interactions, RESCAL can require a lot of parameters when 
the number of latent features are large (each matrix Wj, has 
fFg entries). This can, for instance, lead to scalability problems 
on knowledge graphs with a large number of relations. 

In the following we will discuss models based on multi¬ 
layer perceptrons (MLPs), also known as feedforward neural 
networks. In the context of multidimensional data they can 
be referred to a muliway neural networks. This approach 
allows us to consider alternative ways to create composite 
triple representations and to use nonlinear functions to predict 
their existence. 

In particular, let us define the following E-MLP model (E 
for entity): 


t*E-MLP 

J ijk 

^IsiKjk) 

(6) 

K Jk ■= 

A it 4>fj MLF 

(7) 

4 :MLP : = 

[e.j, e , | 

(8) 


TABLE IV 

Semantic Embeddings of KV-MLP on Freebase 


Relation 



Nearest Neighbors 



children 

parents 

(0.4) 

spouse 

(0.5) 

birth-place 

(0.8) 

birth-date 

children 

(1.24) 

gender 

(1.25) 

parents 

(1.29) 

edu-enc| 1Q | 

job-start 

(1.41) 

edu-start 

(1.61) 

job-end 

(1.74) 


where g(u) = [g(u±), g(u 2 ), ■ ■ ■] is the function g applied 
element-wise to vector u; one often uses the nonlinear function 
g{u) = tanh(u). 

Here h a is an additive hidden layer, which is deriving 
by adding together different weighed components of the 
entity representations. In particular, we create a composite 
representation 4>fj MLP = [e^; e^] G R 2fla via the concatenation 
of e, and c r However, concatenation alone does not consider 
any interactions between the latent features of e* and e :) . 
For this reason, we add a (vector-valued) hidden layer h a 
of size H a , from which the final prediction is derived via 
wjg(h a ). The important difference to tensor-product models 
like RESCAL is that we learn the interactions of latent 
features via the matrix A* . (Equation (j7]i), while the tensor 
product considers always all possible interactions between 
latent features. This adaptive approach can reduce the number 
of required parameters significantly, especially on datasets with 
a large number of relations. 

One disadvantage of the E-MLP is that it has to define 
a vector w/ :: and a matrix A/ i: for every possible relation, 
which requires H a + (H a x 2H e ) parameters per relation. 
An alternative is to embed the relation itself, using a H r - 
dimensional vector iy ; . We can then define 


t-ER-MLP 

J ijk 

:= w T g(h^- fc ) 

(9) 


r*T j,ER-MLP 
^ (p ijk 

00) 

4 ” 

:= [e,;;e,;r fc J. 

(11) 


We call this model the ER-MLP, since it applies an MLP to 
an embedding of the entities and relations. Please note that 
ER-MLP uses a global weight vector for all relations. This 
model was used in the KV project (see Section [IX| , since it 
has many fewer parameters than the E-MLP (see Table |V); the 
reason is that C is independent of the relation k. 

It has been shown in ED that MLPs can learn to put 
“semantically similar” words close by in the embedding space, 
even if they are not explicitly trained to do so. In j28l . they show 
a similar result for the semantic embedding of relations using 
ER-MLP. For example. Table IV shows the nearest neighbors 
of latent representations of selected relations that have been 
computed with a 60 dimensional model on Freebase. Numbers 
in parentheses represent squared Euclidean distances. It can 
be seen that ER-MLP puts semantically related relations near 
each other. For instance, the closest relations to the children 
relation are parents, spouse, and birthplace. 


ln The relations edu-start, edu-end, job-start , job-end represent the start and 
end dates of attending an educational institution and holding a particular job. 
respectively 
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Fig. 5. Visualization of RESCAL and the ER-MLP model as Neural Networks. Here. II, = II, = 3 and II,, = 3. Note, that the inputs are latent features. 
The symbol g denotes the application of the function <?(•). 


TABLE V 

Summary of the latent feature models. h a , h 6 and h c are hidden layers of the neural network; see text for details. 


Method 

Jijk 

At 


c 

Bfc 

Num. Parameters 

RESCAL (64) 

w7 h b , 

- 


- 

[5i,i,■ 

N r Hi + N e H e 

E-MLP (13 

w feS( h ?,-*) 

Afc; 

K] 

- 

- 

Nr (Ha + H a x 2 He) + N e H e 

ER-MLP (28) 

w T g(hC fe ) 

- 


c 

- 

He + He x (2 H e + H r ) + N r Hr + N e H e 

NTN (92) 

w fcg([ h ?A h ;jfc]) 

[A|;A ° k \ 

- 

[ B fc--- 

• - Bf 6 ] NlH b + N r (H b + H a ) + 2 N r H e H a + N e H e 

Structured Embeddings B93I 

-Il h “jfclli 

[AJ; 

-K] 

- 

- 

2N r H e H a + N e H e 

TransE (94) 

-(2h- jk -2h\ jk + \\ rk \\l) 

[ r fei - 

r k\ 

- 

I 

N r H e + N e H e 


E. Neural tensor networks 

We can combine traditional MLPs with bilinear models, 
resulting in what |[92l calls a “neural tensor network” (NTN). 
More precisely, we can define the NTN model as follows; 


rNTN 

J ijk 

■■= wjg([h“ fc ;h^ fc ]) 

(12) 

cr 

.= Afc [e.,; e,J 

(13) 

h b 

n ijk 

:= Bfc.ey, ■ ■ ■ ,ej 

(14) 


Here B ; is a tensor, where the /-th slice B| has size If x 
H e , and there are Ilf, slices. We call h£. fc a bilinear hidden 
layer, since it is derived from a weighted combination of 
multiplicative terms. 

NTN is a generalization of the RESCAL approach, as we 
explain in Section |XII-A| Also, it uses the additive layer from 
the E-MLP model. However, it has many more parameters 
than the E-MLP or RESCAL models. Indeed, the results in 
{95ll and li28l both show that it tends to overfit, at least on the 
(relatively small) datasets uses in those papers. 


F. Latent distance models 

Another class of models are latent distance models (also 
known as latent space models in social network analysis), 
which derive the probability of relationships from the distance 
between latent representations of entities: entities are likely 
to be in a relationship if their latent representations are close 
according to some distance measure. For uni-relational data, 
|[96l proposed this approach first in the context of social 
networks by modeling the probability of a relationship Xij 
via the score function /(e^ey) = — d(ei,ej ) where 
refers to an arbitrary distance measure such as the Euclidean 
distance. 


The structured embedding (SE) model l93l extends this idea 
to multi-relational data by modeling the score of a triple Xijk 
as: 


f!jk ■■= -||A^e. i -A^|| 1 = -||h“ fc || 1 (15) 


where = [A|;— A£]. In Equation (15 i the matrices A|, 
A° k transform the global latent feature representations of entities 
to model relationships specifically for the fc-th relation. The 
transformations are learned using the ranking loss in a way 
such that pairs of entities in existing relationships are closer 
to each other than entities in non-existing relationships. 

To reduce the number of parameters over the SE model, the 
TransE model ll94fl translates the latent feature representations 
via a relation-specific offset instead of transforming them via 
matrix multiplications. In particular, the score of a triple x^j. 
is defined as: 


fijk '■= -d(ei+r k ,ej). (16) 

This model is inspired by the results in 19T1 . who showed that 
some relationships between words could be computed by their 
vector difference in the embedding space. As noted in |j95l , 
under unit-norm constraints on e,, e ;/ and using the squared 
Euclidean distance, we can rewrite Equation ( fl6| ) as follows: 

= — (2rl(e ? : - e,) - 2eJej + ||r fc |||) (17) 

Furthermore, if we assume A fc = [iy.; —iy : |, so that 
Kjk = [ r fc; - r d T [ e *; e j] = r K e i - e i)> and = i, so that 
h\j k = eje-j, then we can rewrite this model as follows: 

/“ = -(2^ ifc -2^. fc + ||r fc |). (18) 


G. Comparison of models 

Table |V] summarizes the different models we have discussed. 
A natural question is: which model is best? j28l showed that 
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the ER-MLP model outperformed the NTN model on their 
particular dataset. l95l performed more extensive experimental 
comparison of these models, and found that RESCAL (called 
the bilinear model) worked best on two link prediction tasks. 
However, clearly the best model will be dataset dependent. 

V. Graph Feature Models 

In this section, we assume that the existence of an edge 
can be predicted by extracting features from the observed 
edges in the graph. For example, due to social conventions, 
parents of a person are often married, so we could predict 
the triple (John, marriedTo, Mary) from the existence of the 

parentOf . parentOf _ 

path John -> Anne < - Mary , representing a com- 

mon child. In contrast to latent feature models, this kind of 
reasoning explains triples directly from the observed triples in 
the knowledge graph. We will now discuss some models of 
this kind. 

A. Similarity measures for uni-relational data 

Observable graph feature models are widely used for link 
prediction in graphs that consist only of a single relation, 
e.g., social network analysis (friendships between people), 
biology (interactions of proteins), and Web mining (hyperlinks 
between Web sites). The intuition behind these methods is that 
similar entities are likely to be related (homophily) and that 
the similarity of entities can be derived from the neighborhood 
of nodes or from the existence of paths between nodes. For 
this purpose, various indices have been proposed to measure 
the similarity of entities, which can be classified into local, 
global, and quasi-local approaches El. 

Focal similarity indices such as Common Neighbors, the 
Adamic-Adar index |98l or Preferential Attachment ^99\ derive 
the similarity of entities from their number of common neigh¬ 
bors or their absolute number of neighbors. Focal similarity 
indices are fast to compute for single relationships and scale 
well to large knowledge graphs as their computation depends 
only on the direct neighborhood of the involved entities. 
However, they can be too localized to capture important 
patterns in relational data and cannot model long-range or 
global dependencies. 

Global similarity indices such as the Katz index 11 001 and 
the Leicht-Holme-Newman index 1110111 derive the similarity of 
entities from the ensemble of all paths between entities, while 
indices like Hitting Time, Commute Time, and PageRank nmi 
derive the similarity of entities from random walks on the graph. 
Global similarity indices often provide significantly better 
predictions than local indices, but are also computationally 
more expensive El ED. 

Quasi-local similarity indices like the Local Katz index m 
or Local Random Walks USD try to balance predictive accuracy 
and computational complexity by deriving the similarity of 
entities from paths and random walks of bounded length. 

In Section [WC} we will discuss an approach that extends this 
idea of quasi-local similarity indices for uni-relational networks 
to learn from large multi-relational knowledge graphs. 


B. Rule Mining and Inductive Logic Programming 

Another class of models that works on the observed variables 
of a knowledge graph extracts rules via mining methods and 
uses these extracted rules to infer new links. The extracted 
rules can also be used as a basis for Markov Fogic as 
discussed in Section |VHI| For instance, ALEPH is an Inductive 
Logic Programming (ILP) system that attempts to learn rules 
from relational data via inverse entailment 111041 (For more 
information on ILP see e.g., 11051 [3] 1061 ). AMIE is a rule 
mining system that extracts logical rules (in particular Horn 
clauses) based on their support in a knowledge graph 1 1071 f08l. 
In contrast to ALEPH, AMIE can handle the open-world 
assumption of knowledge graphs and has shown to be up 
to three orders of magnitude faster on large knowledge 
graphs 111081 . The basis for the Semantic Web is Description 
Logic and EU QTO] QTT) describe approaches for logic- 
oriented machine learning approaches in this context. Also 
to mention are data mining approaches for knowledge graphs 
as described in (112.11131 1141 . An advantage of rule-based 
systems is that they are easily interpretable as the model is given 
as a set of logial rules. However, rules over observed variables 
cover usually only a subset of patterns in knowledge graphs (or 
relational data) and useful rules can be challenging to learn. 

C. Path Ranking Algorithm 

The Path Ranking Algorithm (PRA) 111 151 il l 16 1 extends the 
idea of using random walks of bounded lengths for predicting 
links in multi-relational knowledge graphs. In particular, let 
7r l (i- j, k , t ) denote a path of length L of the form —(• e 2 

e 3 • • • -4 Cj, where t represents the sequence of edge types 
t = (ri, r- 2 , ■ ■ ■, r[f. We also require there to be a direct arc 
ei 24 ej, representing the existence of a relationship of type k 
from ei to ej. Let n l(*, j, k) represent the set of all such paths 
of length L, ranging over path types t. (We can discover such 
paths by enumerating all (type-consistent) paths from entities 
of type ei to entities of type ej. If there are too many relations 
to make this feasible, we can perform random sampling.) 

We can compute the probability of following such a path 
by assuming that at each step, we follow an outgoing link 
uniformly at random. Let P(7Ti(«, j, k,t)) be the probability 
of this particular path; this can be computed recursively by 
a sampling procedure, similar to PageRank (see IHT61 for 
details). The key idea in PRA is to use these path probabilities 
as features for predicting the probability of missing edges. 
More precisely, define the feature vector 

tiff = [P(tt) : t r e U L (i, j, k)] (19) 

We can then predict the edge probabilities using logistic 
regression: 

tiff ■■= w l tiff (20) 

Interpretability: A useful property of PRA is that its model is 
easily interpretable. In particular, relation paths can be regarded 
as bodies of weighted rules — more precisely Horn clauses — 
where the weight specifies how predictive the body of the rule 
is for the head. For instance. Table [VI] shows some relation 
paths along with their weights that have been learned by PRA 
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TABLE VI 

Examples of paths learned by PRA on Freebase to predict which 

COLLEGE A PERSON ATTENDED 


Relation Path 

FI 

Prec 

Rec 

Weight 

(draftedBy, school) 

0.03 

1.0 

0.01 

2.62 

(sibling(s), sibling, education, institution) 

0.05 

0.55 

0.02 

1.88 

(spouse(s), spouse, education, institution) 

0.06 

0.41 

0.02 

1.87 

(parents, education, institution) 

0.04 

0.29 

0.02 

1.37 

(children, education, institution) 

0.05 

0.21 

0.02 

1.85 

(placeOfBirth, peopleBornHere, education) 

0.13 

0.1 

0.38 

6.4 

(type, instance, education, institution) 

0.05 

0.04 

0.34 

1.74 

(profession, peopleWithProf, edu., inst.) 

0.04 

0.03 

0.33 

2.19 


in the KV project (see Section 0 to predict which college a 
person attended, i.e., to predict triples of the form (p, college, 
c). The first relation path in Table VI can be interpreted as 
follows: it is likely that a person attended a college if the 
sports team that drafted the person is from the same college. 
This can be written in the form of a Horn clause as follows: 


(p, college, c) <— (p, draftedBy, t) a (t, school, c) . 

By using a sparsity promoting prior on wwe can perform 
feature selection, which is equivalent to rule learning. 

Relational learning results: PRA has been shown to out¬ 
perform the ILP method FOIL 111061 for link prediction in 
NELL B116I . It has also been shown to have comparable 
performance to ER-MLP on link prediction in KV: PRA 
obtained a result of 0.884 for the area under the ROC curve, 
as compared to 0.882 for ER-MLP l28l . 


VI. Combining latent and graph feature models 


exploiting the symmetry of the relation. If the (Mary, marriedTo, 
John) edge is unknown, we can use statistical patterns, such 
as the existence of shared children. 

Combining the strengths of latent and graph-based models 
is therefore a promising approach to increase the predictive 
performance of graph models. It typically also speeds up the 
training. We now discuss some ways of combining these two 
kinds of models. 

A. Additive relational effects model 

cm proposed the additive relational effects (ARE), which 
is a way to combine RLFMs with observable graph models. 
In particular, if we combine RESCAL with PRA, we get 

/5? c™ = w<+ wf T 0“. (21) 

ARE models can be trained by alternately optimizing the 
RESCAL parameters with the PRA parameters. The key benefit 
is now RESCAL only has to model the “residual errors” that 
cannot be modelled by the observable graph patterns. This 
allows the method to use much lower latent dimensionality, 
which significantly speeds up training time. The resulting 
combined model also has increased accuracy cm. 

B. Other combined models 

In addition to ARE, further models have been explored to 
leant jointly from latent and observable patterns on relational 
data. l84l l85l l combined a latent feature model with an additive 
term to learn from latent and neighborhood-based information 
on multi-relational data, as followsp] 


It has been observed experimentally (see, e.g., If28l ) that 
neither state-of-the-art relational latent feature models (RLFMs) 
nor state-of-the-art graph feature models are superior for 
learning from knowledge graphs. Instead, the strengths of latent 
and graph-based models are often complementary (see e.g., 
03 ). as both families focus on different aspects of relational 
data: 

. Latent feature models are well-suited for modeling global 
relational patterns via newly introduced latent variables. 
They are computationally efficient if triples can be 
explained with a small number of latent variables. 

. Graph feature models are well-suited for modeling local 
and quasi-local graphs patterns. They are computationally 
efficient if triples can be explained from the neighborhood 
of entities or from short paths in the graph. 

There has also been some theoretical work comparing these 
two approaches am In particular, it has been shown that 
tensor factorization can be inefficient when relational data 
consists of a large number of strongly connected components. 
Fortunately, such “problematic” relations can often be handled 
efficiently via graph-based models. A good example is the 
marriedTo relation: One marriage corresponds to a single 
strongly connected component, so data with a large number of 
marriages would be difficult to model with RLFMs. However, 
predicting marriedTo links via graph-based models is easy: the 
existence of the triple (John, marriedTo, Mary) can be simply 
predicted from the existence of (Mary, marriedTo, John), by 


I-ADD 
J ijk 
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(23) 


Here, <^f UB is the latent representation of entity e, as a subject 
and BJ is the latent representation of entity ej as an object. 
The term tpfj k captures patterns efficiently where the existence 
of a triple y t jk' is predictive of another triple y l .p : between 
the same pair of entities (but of a different relation type). For 
instance, if Leonard Nimoy was born in Boston, it is also likely 
that he lived in Boston. This dependency between the relation 
types bornln and livedln can be modeled in Equation ( |23] > by 
assigning a large weight to w bornln j lvcd , n . 

ARE and the models of lf84l and lf85l are similar in 
spirit to the model of 013, which augments SVD (i.e., 
matrix factorization) of a rating matrix with additive terms to 
include local neighborhood information. Similarly, factorization 
machines 11201 allow to combine latent and observable patterns, 
by modeling higher-order interactions between input variables 
via low-rank factorizations ED- 

An alternative way to combine different prediction systems 
is to fit them separately, and use their outputs as inputs to 
another “fusion” system. This is called stacking 11211 , For 
instance, ll28l used the output of PRA and ER-MLP as scalar 
features, and learned a final “fusion” layer by training a binary 


11 l85l considered an additional term := fffiP + svf <(>fj JB+0BJ , 

where <(>f BB+0BJ is a (non-composite) latent feature representation of subject- 
object pairs. 
















12 


classifier. Stacking has the advantage that it is very flexible 
in the kinds of models that can be combined. However, it has 
the disadvantage that the individual models cannot cooperate, 
and thus any individual model needs to be more complex than 
in a combined model which is trained jointly. For example, if 
we fit RESCAL separately from PRA, we will need a larger 
number of latent features than if we fit them jointly. 

VII. Training SRL models on knowledge graphs 

In this section we discuss aspects of training the previously 
discussed models that are specific to knowledge graphs, such 
as how to handle the open-world assumption of knowledge 
graphs, how to exploit sparsity, and how to perform model 
selection. 

A. Penalized maximum likelihood training 

Let us assume we have a set of Nd observed triples and 
let the n-th triple be denoted by x n . Each observed triple is 
either true (denoted y n = 1) or false (denoted y n = 0). Let this 
labeled dataset be denoted by V = {( x n , y n ) | n = 1,..., Nd}. 
Given this, a natural way to estimate the parameters 0 is to 
compute the maximum a posteriori (MAP) estimate: 

N d 

max ^ logBer(y” | a(f(x n ; 0))) + logp(0 | A) (24) 

n= 1 


all-positive data is tricky, because the model might easily over 
generalize. 

One way around this is as to make a closed world as¬ 
sumption and assume that all (type consistent) triples that 
are not in V + are false. We will denote this negative set as 
T>~ = {x n e V | y n = 0}. However, for incomplete knowledge 
graphs this assumption will be violated. Moreover, T>~ might 
be very large, since the number of false facts is much larger 
than the number of true facts. This can lead to scalability issues 
in training methods that have to consider all negative examples. 

An alternative approach to generate negative examples is to 
exploit known constraints on the stmcture of a knowledge graph: 
Type constraints for predicates (persons are only married to 
persons), valid value ranges for attributes (the height of humans 
is below 3 meters), or functional constraints such as mutual 
exclusion (a person is born exactly in one city) can all be used 
for this purpose. Since such examples are based on the violation 
of hard constraints, it is certain that they are indeed negative 
examples. Unfortunately, functional constraints are scarce and 
negative examples based on type constraints and valid value 
ranges are usually not sufficient to train useful models: While it 
is relatively easy to predict that a person is married to another 
person, it is difficult to predict to which person in particular. 
For the latter, examples based on type constraints alone are not 
very informative. A better way to generate negative examples 
is to “perturb” true triples. In particular, let us define 


where A controls the strength of the prior. (If the prior is 
uniform, this is equivalent to maximum likelihood training.) 
We can equivalently state this as a regularized loss minimization 
problem: 

N 

min Yi + Areg(0) (25) 

n=l 

where C(p,y) = — log Ber(y|p) is the log loss function. 
Another possible loss function is the squared loss, C(p, y ) = 
(p — y) 2 . Using the squared loss can be especially efficient 
in combination with a closed-world assumption (CWA). For 
instance, using the squared loss and the CWA, the minimization 
problem for RESCAL becomes 

min V ||Y fc - EW fc E T ||f, + X 1 \\E\\ 2 F + X 2 V ||W fc ||^. 
E ’( w *l * k 

(26) 

where Ai,A 2 ^ 0 control the degree of regularization. The 
main advantage of Equation ( |26| ) is that it can be optimized via 
RESCAL-ALS, which consists of a sequence of very efficient, 
closed-form updates whose computational complexity depends 
only on the non-zero entries in Y ||63l 1641. We discuss some 
other loss functions below. 

B. Where do the negative examples come from? 

One important question is where the labels y n come from. 
The problem is that most knowledge graphs only contain 
positive training examples, since, usually, they do not encode 
false facts. Hence y n = 1 for all ( x n , y n ) e V. To emphasize 
this, we shall use the notation V + to represent the observed 
positive (true) triples: T> + = {x n e V \ y n = 1}. Training on 


V = {(ee,r k ,ej) | e* ¥= eg a (e*, r fc , e.,) e V + } 
u {( e u r k ,ei ) | ej A e t a ( ei,r k ,ej ) e £>+} 

To understand the difference between this approach and the 
CWA (where we assumed all valid unknown triples were 
false), let us consider the example in Figure [T] The CWA 
would generate “good” negative triples such as ( LeonardNimoy, 
starredln, StarWars), ( AlecGuinness, starredln, StarTrek), etc., 
but also type-consistent but “irrelevant” negative triples such 
as (BarackObama, starredln, StarTrek). etc. (We are assuming 
(for the sake of this example) there is a type Person but not 
a type Actor.) The second approach (based on perturbation) 
would not generate negative triples such as ( BarackObama, 
starredln, StarTrek), since BarackObama does not participate 
in any starredln events. This reduces the size of T>~ , and 
encourages it to focus on “plausible” negatives. (An even 


better method, used in Section IX is to generate the candidate 


triples from text extraction methods run on the Web. Many of 
these triples will be false, due to extraction errors, but they 
define a good set of “plausible” negatives.) 

Another option to generate negative examples for training is 
to make a local-closed world assumption (LCWA) HI0711281 . 
in which we assume that a KG is only locally complete. More 
precisely, if we have observed any triple for a particular subject- 
predicate pair ei,r k , then we will assume that any non-existing 
triple (ej,rfc, •) is indeed false and include them in T>~. (The 
assumption is valid for functional relations, such as bornln, 
but not for set-valued relations, such as starredln.) However, 
if we have not observed any triple at all for the pair ei,r k , 
we will assume that all triples ( ei,r k , •) are unknown and not 
include them in T>~. 
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C. Pairwise loss training 

Given that the negative training examples are not always 
really negative, an alternative approach to likelihood training 
is to try to make the probability (or in general, some scoring 
function) to be larger for true triples than for assumed-to-be- 
false triples. That is, we can define the following objective 
function: 

min Y Y ^(f(x + ;O),f(x~;0)) + Areg(O) (21) 

x+eT>+ x~eT>- 

where £(/, /') is a margin-based ranking loss function such 
as 

£(/,/') = max(l+ /'-/,()). (28) 

This approach has several advantages. First, it does not assume 
that negative examples are necessarily negative, just that they 
are “more negative” than the positive ones. Second, it allows 
the /(•) function to be any function, not just a probability (but 
we do assume that larger / values mean the triple is more 
likely to be correct). 

This kind of objective function is easily optimized by 
stochastic gradient descent (SGD) CH: at each iteration, 
we just sample one positive and one negative example. SGD 
also scales well to large datasets. However, it can take a long 
time to converge. On the other hand, as discussed previously, 
some models, when combined with the squared loss objective, 
can be optimized using alternating least squares (ALS), which 
is typically much faster. 

D. Model selection 

Almost all models discussed in previous sections include 
one or more user-given parameters that are influential for the 
model’s performance (e.g., dimensionality of latent feature mod¬ 
els, length of relation paths for PRA, regularization parameter 
for penalized maximum likelihood training). Typically, cross- 
validation over random splits of V into training-, validation-, 
and test-sets is used to find good values for such parameters 
without overfitting (for more information on model selection 
in machine learning see e.g., B123II ). For link prediction and 
entity resolution, the area under the ROC curve (AUC-ROC) or 
the area under the precision-recall curve (AUC-PR) are good 
evaluation criteria. For data with a large number of negative 
examples (as it is typically the case for knowledge graphs), 
it has been shown that AUC-PR can give a clearer picture of 
an algorithm’s performance than AUC-ROC II 1241 . For entity 
resolution, the mean reciprocal rank (MRR) of the correct 
entity is an alternative evaluation measure. 

VIII. Markov random fields 

In this section we drop the assumption that the random 
variables in Y are conditionally independent. However, 
in the case of relational data and without the conditional 
independence assumption, each can depend on any of 
the other N e x N e x N r 1 random variables in Y. Due to 
this enormous number of possible dependencies, it becomes 
quickly intractable to estimate the joint distribution P(Y) 
without further constraints, even for very small knowledge 


graphs. To reduce the number of potential dependencies and 
arrive at tractable models, in this section we develop template- 
based graphical models that only consider a small fraction of 
all possible dependencies. 

(See 11251 for an introduction to graphical models.) 

A. Representation 

Graphical models use graphs to encode dependencies be¬ 
tween random variables. Each random variable (in our case, a 
possible fact yijk) is represented as a node in the graph, while 
each dependency between random variables is represented as an 
edge. To distinguish such graphs from knowledge graphs, we 
will refer to them as dependency graphs. It is important to be 
aware of their key difference: while knowledge graphs encode 
the existence of facts, dependency graphs encode statistical 
dependencies between random variables. 

To avoid problems with cyclical dependencies, it is common 
to use undirect ed g raphical models, also called Markov Random 
Fields (MRFs)0A MRF has the following form: 

p (xi0)=!n^i 0 ) (29 > 

C 

where w(y,\0) 3 s 0 is a potential function on the c-th subset 
of variables, in particular the c-th clique in the dependency 
graph, and Z = Xi y 1 I c U(y r |0j is the partition function, 
which ensures that the distribution sums to one. The potential 
functions capture local correlations between variables in each 
clique c in the dependency graph. (Note that in undirected 
graphical models, the local potentials do not have any proba¬ 
bilistic interpretation, unlike in directed graphical models.) This 
equation again defines a probability distribution over “possible 
worlds”, i.e., over joint distribution assigned to the random 
variables Y. 

The structure of the dependency graph (which defines the 
cliques in Equation ( |29| ) is derived from a template mechanism 
that can be defined in a number of ways. A common approach 
is to use Markov logic 111261 . which is a template language 
based on logical formulae: 

Given a set of formulae T = { h) )[ = i , we create an edge 
between nodes in the dependency graph if the corresponding 
facts occur in at least one grounded formula. A grounding of 
a formula F, is given by the (type consistent) assignment of 
entities to the variables in Fj. Furthermore, we define t/j(y c \6) 
such that 

p (Y|0)= -[I ex P(^c) (30) 

C 

where x c denotes the number of true groundings of F c in Y, 
and 6 C denotes the weight for formula F c . If 0 C > 0, we prefer 
worlds where formula F c is satisfied; if 9 C < 0, we prefer 
worlds where formula F c is violated. If 0 C = 0, then formula 
F c is ignored. 

To explain this further, consider a KG involving two types 
of entities, adults and children, and two types of relations, 
parentOf and marriedTo. Figure [6a| depicts a sample KG with 
three adults and one child. Obviously, these relations (edges) 

^Technically, since we are conditioning on some observed features x, this 
is a Conditional Random Field (CRF), but we will ignore this distinction. 
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are correlated, since people who share a common child are 
often married, while people rarely marry their own children. In 
Markov logic, we represent these dependencies using formulae 
such as: 

F\ : (x,parentOf z) a {y, parentOf z) => ( x,marriedTo,y ) 

F 2 : ( x,marriedTo,y ) => -'{y, parent Of,x) 

Rather than encoding the rule that adults cannot marry their 
own children using a formula, we will encode this as a hard 
constraint into the type system. Similarly, we only allow adults 
to be parents of children. Thus, there are 6 possible facts 
in the knowledge graph. To create a dependency graph for 
this KG and for this set of logical formulae F, we assign a 
binary random variable to each possible fact, represented by a 
diamond in Figure [ 6 b] and create edges between these nodes if 
the corresponding facts occur in grounded formulae F\ or F 2 . 
For instance, grounding F\ with x = a, \ , y = 03 , and z = c, 
creates the edges 77133 ~ > Pi c -> m i 3 » P 3 C, and Pi c P 3 C - 
The full dependency graph is shown in Figure [ 6 c] 

The process of generating the MRF graph by applying 
templated rules to a set of entities is known as grounding 
or instantiation. We note that the topology of the resulting 
graph is quite different from the original KG. In particular, 
we have one node per possible KG edge, and these nodes are 
densely connected. This can cause computational difficulties, 
as we discuss below. 

B. Inference 

The inference problem consists of estimating the most 
probable configuration, y* = arg max y p(y \ 0 ), or the posterior 
marginals p(yf\0). In general, both of these problems are 
computationally intractable lil25ll . so heuristic approximations 
must be used. 

One approach for computing posterior marginals is to use 
Gibbs sampling (see, or example, 1 31, [127 ]) or MC-SAT jl28l . 
One approach for computing the MAP estimate is to use the 
MPLP (max product linear programming) method 11291 . See 
El for more details. 

If one restricts the class of potential functions to be just 
disjunctions (using OR and NOT, but no AND), then one 
obtains a (special case of) hinge loss MRF (HL-MRFs) ll30l . 
for which efficient convex algorithms can be applied, based 
on a continuous relaxation of the binary random variables. 
Probabilistic Soft Logic (PSL) 11311 provides a convenient 
form of “syntactic sugar” for defining HL-MRFs, just as MLNs 
provide a form of syntactic sugar for regular (boolean) MRFs. 
HL-MRFs have been shown to scale to fairly large knowledge 
bases USD- 

C. Learning 

The “learning” problem for MRFs deals with specifying the 
form of the potential functions (sometimes called “structure 
learning”) as well as the values for the numerical parameters 
6. In the case of MRFs for KGs, the potential functions are 
often specified in the form of logical rules, as illustrated above. 
In this case, structure learning is equivalent to rule learning, 



Fig. 7. Architecture of the Knowledge Vault. 

which has been studied in a number of published works (see 
Section [WCl and fl07ll95i'). 

The parameter estimation problem (which is usually cast as 
maximum likelihood or MAP estimation), although convex, is 
in general quite expensive, since it needs to call inference as 
a subroutine. Therefore, various faster approximations, such 
as pseudo likelihood, have been developed (cf. relational 
dependency networks 111331 ). 

D. Discussion 

Although approaches based on MRFs are very flexible, it 
is in general harder to make scalable inference and devise 
learning algorithms for this model class, compared to methods 
based on observable or even latent feature models. In this 
article, we have chosen to focus primarily on latent and graph 
feature models because we have more experience with such 
methods in the context of KGs. However, all three kinds of 
approaches to KG modeling are useful. 

IX. Knowledge Vault: relational learning for 

KNOWLEDGE BASE CONSTRUCTION 

The Knowledge Vault (KV) Il28l is a very large-scale 
automatically constructed knowledge base, which follows the 
Freebase schema (KV uses the 4469 most common predicates). 
It is constructed in three steps. In the first step, facts are 
extracted from a host of Web sources such as natural language 
text, tabular data, page structure, and human annotations (the 
extractors are described in detail in lf28l h Second, an SRL 
model is trained on Freebase to serve as a “prior” for computing 
the probability of (new) edges. Finally, the confidence in 
the automatically extracted facts is evaluated using both the 
extraction scores and the prior SRL model. 

The Knowledge Vault uses a combination of latent and 
observable models to predict links in a knowledge graph. In 
particular, it employs the ER-MLP model (Section |IV-D[ ) as a 
latent feature model and PRA (Section |V-C| > as a graph feature 
model. In order to combine the two models, KV uses stacking 
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Fig. 6 . |(a)| A small KG. There are 4 entities (circles): 3 adults (ai, < 12 , < 13 ) and 1 child c There are 2 types of edges: adults may or may not be married to 
each other, as indicated by the red dashed edges, and the adults may or may not be parents of the child, as indicated by the blue dotted edges, [(b)] We add 
binary random variables (represented by diamonds) to each KG edge. |(c)| We drop the entity nodes, and add edges between the random variables that belong to 
the same clique potential, resulting in a standard MRF. 


(Section VI-B 1 . To evaluate the link prediction performance, 
these models were applied to a subset of Freebase. The ER- 
MLP system achieved an area under the ROC curve (AUC- 
ROC) of 0.882, and the PRA approach achieved an almost 
identical AUC-ROC of 0.884. The combination of both methods 
further increased the AUC-ROC to 0.911. To predict the final 
score of a triple, the scores from the combined link-prediction 
model are further combined with various features derived from 
the extracted triples. These include, for instance, the confidence 
of the extractors and the number of (de-duplicated) Web pages 
from which the triples were extracted. Figure [7] provides a high 
level overview of the Knowledge Vault architecture. 

Let us give a qualitative example of the benefits of combining 
the prior with the extractors (i.e., the Fusion Layer in Figure [7j. 
Consider an extracted triple corresponding to the following 
relational] 


(Barry Richter, attended, University of Wisconsin-Madison). 


The extraction confidence for this triple (obtained by fusing 
multiple extraction techniques) is just 0.14, since it was based 
on the following two rather indirect statements p] 

In the fall of 1989, Richter accepted a scholarship to 
the University of Wisconsin, where he played for four 


years and earned numerous individual accolades ... 



The Polar Caps’ cause has been helped by the impact 
of knowledgable coaches such as Andringa, Byce 
and former UW teammates Chris Tancill and Barry 
Richter. 

However, we know from Freebase that Barry Richter was born 
and raised in Madison, Wisconsin. According to the prior 


13 For clarity of presentation we show a simplified triple. Please see 
for the actually extracted triples including compound value types (CVT). 

14 Source: http://www.legendsofhockey.net/LegendsOfHockey/jsp/ 

SearchPlayer.jsp?player=l 1377 

13 Source: http://host.madison.com/sports/high-school/hockey/numbers- 
dwindling-for-once-mighty-madison-high-school-hockey-programs/article_ 
95843e00-ec34-lldf-9da9-001cc4c002e0.html 


model, people who were born and raised in a particular city 
often tend to study in the same city. This increases our prior 
belief that Richter went to school there, resulting in a final 
fused belief of 0.61. 

Combining the prior model (learned using SRL methods) 
with the information extraction model improved performance 
significantly, increasing the number of high confidence triple p’] 
from 100M (based on extractors alone) to 271M (based on 
extractors plus prior). The Knowledge Vault is one of the 
largest applications of SRL to knowledge base construction to 
date. See j28l for further details. 

X. Extensions and Future Work 
A. Non-binary relations 

So far we completely focussed on binary relations; here we 
discuss how relations of other cardinalities can be handled. 

Unary relations: Unary relations refer to statements on 
properties of entities, e.g., the height of a person. Such 
data can naturally be represented by a matrix, in which 
rows represent entities, and columns represent attributes. Il64l 
proposed a joint tensor-matrix factorization approach to learn 
simultaneously from binary and unary relations via a shared 
latent representation of entities. In this case, we may also need 
to modify the likelihood function, so it is Bernoulli for binary 
edge variables, and Gaussian (say) for numeric features and 
Poisson for count data (see 111341 ). 

Higher-arity relations: In knowledge graphs, higher-arity 
relations are typically expressed via multiple binary rela¬ 
tions. In Section [II| we expressed the ternary relationship 
playedCharacterlnfLeonardNimoy, Spock, StarTrek-1) via two 
binary relationships (LeonardNimoy, played, Spock) and (Spock, 
characterln, StarTrek-1). However, there are multiple actors 
who played Spock in different Star Trek movies, so we 
have lost the correspondence between Leonard Nimoy and 
StarTrek-1. To model this using binary relations without loss 

16 Triples with the calibrated probability of correctness above 90%. 
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of information, we can use auxiliary nodes to identify the 
respective relationship. For instance, to model the relationship 
playedCharacterIn(LeonardNimoy, Spock, StarTrek-1), we can 
write 


subject 


predicate object 


( LeonardNimoy, 
( MovieRole-1, 

( MovieRole-1, 


actor, MovieRole-1 ) 

movie, StarTreck-1) 

character, Spock) 


where we used the auxiliary entity MovieRole-1 to uniquely 
identify this particular relationship. In most applications 
auxiliary entities get an identifier; if not they are referred to as 
blank nodes. In Freebase auxiliary nodes are called Compound 
Value Types (CVT). 

Since higher-arity relations involving time and location 
are relatively common, the YAG02 project extended the 
SPO triple format to the (subject, predicate, object, time, 
location) (SPOTL) format to model temporal and spatial 
information about relationships explicitly, without transforming 
them to binary relations im Furthermore, there has also been 
work on extracting higher-arity relations directly from natural 
language Il35l . 

A related issue is that the truth-value of a fact can change 
over time. For example, Google’s current CEO is Larry Page, 
but from 2001 to 2011 it was Eric Schmidt. Both facts are 
correct, but only during the specified time interval. For this 
reason, Freebase allows some facts to be annotated with 
beginning and end dates, using CVT (compound value type) 
constructs, which represent n-ary relations via auxiliary nodes. 
In the future, it is planned to extend the KV system to model 
such temporal facts. However, this is non-trivial, since it is not 
always easy to infer the duration of a fact from text, since it is 
not necessarily related to the timestamp of the corresponding 
source (cf. USD). 

As an alternative to the usage of auxiliary nodes, a set of 
n— th-arity relations can be represented by a single n + 1-th- 
order tensor. RESCAL can easily be generalized to higher-arity 
relations and can be solved by higher-order tensor factorization 
or by neural network models with the corresponding number 
of entity representations as inputs 111341 . 


B. Hard constraints: types, functional constraints, and others 

Imposing hard constraints on the allowed triples in knowl¬ 
edge graphs can be useful. Powerful ontology languages such as 
the Web Ontology Language (OWL) ll 1371 have been developed, 
in which complex constraints can be formulated. However, 
reasoning with ontologies is computationally demanding, and 
hard constraints are often violated in real-world data 1 1381 l39|. 
Fortunately, machine learning methods can be robust in the 
face of contradictory evidence. 

Deterministic dependencies: Triples in relations such as 
subClassOf and isLocatedln follow clear deterministic depen¬ 
dencies such as transitivity. For example, if Leonard Nimoy 
was born in Boston, we can conclude that he was born 
in Massachusetts, that he was born in the USA, that he 
was born in North America, etc. One way to consider such 
ontological constraints is to precompute all true triples that 


can be derived from the constraints and to add them to 
the knowledge graph prior to learning. The precomputation 
of triples according to ontological constraints is also called 
materialization. However, on large knowledge graphs, full 
materialization can be computationally demanding. 

Type constraints: Often relations only make sense when 
applied to entities of the right type. For example, the domain 
and the range of marriedTo is limited to entities which are 
persons. Modelling type constraints explicitly requires complex 
manual work. An alternative is to learn approximate type 
constraints by simply considering the observed types of subjects 
and objects in a relation. The standard RESCAL model has 
been extended by m and ||69j to handle type constraints of 
relations efficiently. As a result, the rank required for a good 
RESCAL model can be greatly reduced. Furthermore, lf85l 
considered learning latent representations for the argument 
slots in a relation to learn the correct types from data. 

Functional constraints and mutual exclusiveness: Although 
the methods discussed in Sections IV and [V] can model long- 
range and global dependencies between triples, they do not 
explicitly enforce functional constraints that induce mutual 
exclusivity between possible values. For instance, a person 
is born in exactly one city, etc. If one of the these values 
is observed, then observable graph models can prevent other 
values from being asserted, but if all the values are unknown, 
the resulting mutual exclusion constraint can be hard to deal 
with computationally. 


C. Generalizing to new entities and relations 

In addition to missing facts, there are many entities that are 
mentioned on the Web but are currently missing in knowledge 
graphs like Freebase and YAGO. If new entities or predicates 
are added to a KG, one might want to avoid retraining the 
model due to runtime considerations. Given the current model 
and a set of newly observed relationships, latent representations 
of new entities can be calculated approximately in both 
tensor factorization models and in neural networks, by finding 
representations that explain the newly observed relationships 
relative to the current model. Similarly, it has been shown that 
the relation-specific weights W/, in the RESCAL model can 
be calculated efficiently for new relation types given already 
derived latent representations of entities 111401 . 


D. Querying probabilistic knowledge graphs 

RESCAL and KV can be viewed as probabilistic databases 
(see, e.g., 1 1411 1423). In the Knowledge Vault, only the 
probabilities of triples are queried. Some applications might 
require more complex queries such as: Who is born in Rome 
and likes someone who is a child of Albert Einstein. It is known 
that queries involving joins (existentially quantified variables) 
are expensive to calculate in probabilistic databases ( 111411 ). 
In 1114011 . it was shown how some queries involving joins can 
be efficiently handled within the RESCAL framework. 


E. Trustworthiness of knowledge graphs 

Automatically constructed knowledge bases are only as good 
as the sources from which the facts are extracted. Prior studies 
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in the field of data fusion have developed numerous approaches 
for modelling the correctness of information supplied by 
multiple sources in the presence of possible data conflicts (see 
1 143111441 for recent surveys). However, the key assumption in 
data fusion—namely, that the facts provided by the sources are 
indeed stated by them—is often violated when the information 
is extracted automatically. If a given source contains a mistake, 
it could be because the source actually contains a false fact, or 
because the fact has been extracted incorrectly. A recent study 
has formulated the problem of knowledge fusion, where 
the above assumption is no longer made, and the correctness 
of information extractors is modeled explicitly. A follow-up 
study by the authors 111461 developed several approaches for 
solving the knowledge fusion problem, and applied them to 
estimate the trustworthiness of facts in the Knowledge Vault 
(cf. Section m- 

XI. Concluding Remarks 

Knowledge graphs have found important applications in 
question answering, structured search, exploratory search, and 
digital assistants. We provided a review of state-of-the-art 
statistical relational learning (SRL) methods applied to very 
large knowledge graphs. We also demonstrated how statistical 
relational learning can be used in conjunction with machine 
reading and information extraction methods to automatically 
build such knowledge repositories. As a result, we showed 
how to create a truly massive, machine-interpretable “semantic 
memory” of facts, which is already empowering numerous 
practical applications. However, although these KGs are 
impressive in their size, they still fall short of representing 
many kinds of knowledge that humans possess. Notably missing 
are representations of “common sense” facts (such as the fact 
that water is wet, and wet things can be slippery), as well 
as “procedural” or how-to knowledge (such as how to drive 
a car or how to send an email). Representing, learning, and 
reasoning with these kinds of knowledge remains the next 
frontier for AI and machine learning. 

XII. Appendix 

A. RESCAL is a special case of NTN 

Here we show how the RESCAL model of Section ITV-AI is a 
special case of the neural tensor model (NTN) of Section |TV-E| 
To see this, note that RESCAL has the form 

/ijfc SCAL = e J W k ej = wj [ej <g> e*] (31) 

Next, note that 

v 0 u = vec (uv T ) = [u t B 1 v, ... , u T B n v] 

where n = u \ \ v |, and B fc is a matrix of all Os except for a 
single 1 element in the Ai’th position, which “plucks out” the 
corresponding entries from the u and v matrices. For example. 



In general, define Sij as a matrix of all Os except for entry 
(i,j) which is 1. Then if we define B fc = [<5i.i ,... ,Sh c ,h c ] 
we have 

Kjk = [ e J Bfcej-, • ■ •, eJ Bf b ej ] = e, ® e. ; 

Finally, if we define Aj, as the empty matrix (so h'- jk is 
undefined), and g(u) = u as the identity function, then the 
NTN equation 

ff™ = w fe5([h“ fc ; h yfe]) 

matches Equation |3T| 
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